Textual Features for Programming by Example 



Aditya Krishna Menon 
University of California, San Diego, 
9500 Oilman Drive, La JoUa CA 92093 



akmenon@ucsd. edu 



Omer Tamuz 
Faculty of Mathematics and Computer Science, 
The Weizmann Institute of Science, Rehovot Israel 
|Omer . tam uz @ we izman n . ac . il^ 

Sumit Gulwani 
Microsoft Research, 
One Microsoft Way, Redmond, WA 98052 



Butler Lampson 
Microsoft Research, 
One Memorial Drive, Cambridge MA 02142 

^butler . lampson@microsof t . com] 

Adam Tauman Kalai 
Microsoft Research, 
One Memorial Drive, Cambridge MA 02142 
|adum@microsof t . com| 



In Programming by Example, a system attempts to infer a program from in- 
put and output examples, generally by searching for a composition of certain base 
functions. Performing a naive brute force search is infeasible for even mildly in- 
volved tasks. We note that the examples themselves often present clues as to which 
functions to compose, and how to rank the resulting programs. In text processing, 
which is our domain of interest, clues arise from simple textual features: for ex- 
ample, if parts of the input and output strings are permutations of one another, this 
suggests that sorting may be useful. We describe a system that learns the reliability 
of such clues, allowing for faster search and a principled ranking over programs. 
Experiments on a prototype of this system show that this learning scheme facili- 
tates efficient inference on a range of text processing tasks. 




September 19, 2012 



Abstract 



1 



1 Introduction 



Programming by Example (PBE) ifTOl \T\ is an attractive means for end-user program- 
ming tasks, wherein the user provides the machine examples of a task she wishes to 
perform, and the machine infers a program to accompHsh this. This paradigm has 
been used in a wide variety of domains; lH gives a recent overview. We focus on text 
processing, a problem most computer users face (be it reformatting the contents of an 
email or extracting data from a log file), and for which several complete PBE systems 
have been designed, including LAPIS [11|, SMARTedit E, QuickCode |[3l[I21, and 
others fTT T?]. Such systems aim to provide a simpler alternative to the traditional 
solutions to the problem, which involve either tedious manual editing, or esoteric com- 
puting skills such as knowledge of awk or emacs. 

A fundamental challenge in PBE is the following inference problem: given a set of 
base functions, how does one quickly search for programs composed of these functions 
that are consistent with the user-provided examples? One way is to make specific 
assumptions about the nature of the base functions, as is done by many existing PBE 
systems iTTl lSllSl. but this is unsatisfying because it restricts the range of tasks a user 
can perform. The natural alternative, brute force search, is infeasible for even mildly 
involved programs Thus, a basic question is whether there is a solution possessing 
both generality and efficiency. 

This paper aims to take a step towards an affirmative answer to this question. We 
observe that there are often telling features in the user examples suggesting which 
functions are likely. For example, suppose that a user demonstrates their intended 
task through one or more input-output pairs of strings {{xi,yi)}, where each yi is 
a permutation of Xi. This feature provides a clue that when the system is searching 
for the /(•) such that f{xi) ~ yi, sorting functions may be useful. Our strategy is 
to incorporate a library of such clues, each suggesting relevant functions based on 
textual features of the input-output pairs. We learn weights telling us the reliability 
of each clue, and use this to bias program search. This bias allows for significantly 
faster inference compared to brute force search. Experiments on a prototype system 
demonstrate the effectiveness of feature-based learning. 

To clarify matters, we step through a concrete example of our system's operation. 

1.1 Example of our system's operation 

Imagine a user has a long list of names with some repeated entries (say, the Oscar 
winners for Best Actor), and would like to create a list of the unique names, each anno- 
tated with their number of occurrences. Following the PBE paradigm, in our system, 
the user illustrates the operation by providing an example, which is an input-output pair 
of strings. Figure[T]shows one possible such pair, which uses a subset of the full list (in 
particular, the winners from '91-'95) the user possesses. 

One way to perform the above transformation is to first generate an intermediate 
list where each element of the input list is appended with its occurrence count - which 

would look like [ "Anthony Hopkins (1)", "Al Pacino (1)", "Tom Hanks (2)", 

"Tom Hanks (2)", "Nicolas Cage (1)"] - and then remove duplicates. The corre- 
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Anthony Hopkins 
Al Pacino 
Tom Hanks 
Tom Hanks 
Nicolas Cage 



Anthony Hopkins (1) 
Al Pacino (1) 
Tom Hanks (2) 
Nicolas Cage (1) 



Figure 1 ; Input-output example for the desired task. 



sponding program /(•) may be expressed as the composition 

f{x) = dedup(concatLists(a;, " ", concatLists("(", count(a;, x), ")")))■ 

The argument x here represents the list of input lines that the user wishes to process, 
which may be much larger than the input provided in the example. We assume here 
a base language comprising (among others) a function dedup that removes duplicates 
from a list, concatLists that concatenates lists of strings elementwise, implicitly 
expanding singleton arguments, and count that finds the number of occurrences of the 
elements of one list in another 

While simple, this example is out of scope for existing text processing PBE sys- 
tems. Most systems support a restricted, pre-defined set of functions that do not include 
natural tasks like removing duplicates; for example |3 1 only supports functions that op- 
erate on a line-by-line basis. These systems perform inference with search routines 
that are hand-coded for their supported functionality, and are thus not easily extensi- 
ble. (Even if an exception could be made for specific examples like the one above, there 
are countless other text processing applications we would like to solve.) Systems with 
richer functionality are inapplicable because they perform inference with brute force 
search (or a similarly intractable operation |5 1) over all possible function compositions. 
Such a naive search over even a moderate sized library of base functions is unlikely to 
find the complex composition of our example. Therefore, a more generic framework is 
needed. 

Our basic observation is that certain textual features can help bias our search by pro- 
viding clues about which functions may be relevant: in particular, (a) there are dupli- 
cate lines in the input but not output, suggesting that dedup may be useful, (b) there are 
parentheses in the output but not input, suggesting the function concatLists ( " {",L, ") ") 
for some list L, (c) there are numbers on each line of the output but none in the input, 
suggesting that count may be useful, and (d) there are many more spaces in the out- 
put than the input, suggesting that " " may be useful. Our claim is that by learning 
weights that tell us the reliability of these clues - for example, how confident can we 
be that duplicates in the input but not the output suggests dedup - we can significantly 
speed up the inference process over brute force search. 

In more detail, a clue is a function that generates rules in a probabilistic context 
free grammar based on features of the provided example. Each rule corresponds to 
a functiorj^ (possibly with bound arguments) or constant in the underlying program- 
ming language. The rule probabilities are computed from weights on the clues that 
generate them, which in turn are learned from a training corpus of input-output ex- 
amples. To learn /(•), we now search through derivations of this grammar in order 

'when we describe clues as suggesting functions, we implicitly mean the corresponding grammar rule. 
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of decreasing probability. Table [T| illustrates what the grammar may look like for the 
above example. Note that the grammar rules and probabilities are example specific; we 
do not include a rule such as delim— ^ " $ ", say, because there is no instance of " $ " 
in the input or output. Further, compositions of rules may also be generated, such as 

concatList (" (", LIST, " ) ") . 



Production 


Probability 


Production 


Probability 


P-> join (LIST, DELIM) 


1 


CAT-^LIST 


0.7 


LIST-!>split (X, DELIM) 


0.3 


CAT-^DELIM 


0.3 


LIST->concatList (CAT, CAT, CAT) 


0.1 


DELIM^ "\n" 


0.5 


LIST^concatList ( " ( " , CAT, " ) " ) 


0.2 


DELIM-i>" " 


0.3 


LIST->-dedup (LIST) 


0.2 


DELIM->-" ( " 


0.1 


LISH>count (LIST, LIST) 


0.2 


DELIM->-" ) " 


0.1 



Table 1: Example of grammar rules generated for task in Figure [T] 

Table[T]is of course a condensed view of the actual grammar our prototype system 
generates, which is based on a large library of about 100 features and clues. With the 
full grammar, a naive brute force search over compositions takes 30 seconds to find the 
right solution to the example of Figure [T] whereas with learning the search terminates 
in just 0.5 seconds. 

1.2 Contributions 

To the best of our knowledge, ours is the first PBE system to exploit textual features 
for inference, which we believe is a step towards achieving the desiderata of efficiency 
and generality. The former will be demonstrated in an empirical evaluation of our 
learning scheme. For the latter, while the learning component is discussed in the con- 
text of text processing, the approach could possibly be adapted for different domains. 
Further, the resulting system is highly extensible. Through the use of clues, one only 
considers broadly relevant functions during the search for a suitable program: one is 
free to add functionality to process addresses, e.g. , without fear of it adversely affect- 
ing the performance of processing dates. Through the use of learning, we further sift 
amongst these broadly relevant functions, and determine which of them is likely to be 
useful in explaining the given data. A system designer need only write clues for any 
new functionality, and add relevant examples to the training corpus. Our system then 
automatically leams weights associated with these clues. 

1.3 Comparison to previous learning systems 

Most previous PBE systems for text processing handle a relatively small subset of nat- 
ural text processing tasks. This is in order to admit efficient representation and search 
over consistent programs, e.g. using a version space Q, thus sidestepping the issue of 
searching for programs using general classes of functions. To our knowledge, every 
system designed for a library of arbitrary functions searches for appropriate composi- 
tions of functions either by brute force search, or a similarly intractable operation such 
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as invoking a SAT solver |5 |jjOur learning approach based on textual features is thus 
more general and flexible than previous approaches. 

Having said this, our goal in this paper is not to compete with existing PBE systems 
in terms of functionality. Instead, we wish to show that the fundamental PBE inference 
problem may be attacked by learning with textual features. This idea could in fact 
be applied in conjunction with prior systems. A specific feature of the data, such as 
the input and output having the same number of lines, may be a clue that a function 
corresponding to a system like QuickCode |3 1 will be useful. 

2 Formalism of our approach 

We begin a formal discussion of our approach by defining the learning problem in PBE. 

2.1 Programming by example (PBE) 

Let S denote the set of strings. At inference time, the user provides a system input 
z :— {x,x,y) e 5"^, where x represents the data to be processed, and {x,y) is the 
example input-output pair that represents the string transformation the user wishes to 
perform. In the example of the previous section, (x, y) is the pair of strings represented 
in Figure [T] and x is the list of all Oscar winners. While a typical choice for x is 
some prefix of x, this is not required in generaj^ We assume that y — f{x), for some 
unknown target function or program f £ , from the set of functions that map strings 
to strings. Our goal is to recover /(•). 

We do so by defining a probability model Pr[/|z; 6*] over programs, parameterized 
by some 9. Given some 9, at inference time on input z, we pick the most likely program 
under Pr[/|z; 9] which is also consistent with z. We do so by invoking a search function 
ag^r '■ — > that depends on 9 and an upper bound t on search time. This produces 
our conjectured program / = ae^riz) computing a string-to-string transformation, or 
a trivial failure function _L if the search fails in the allotted time. 

The 9 parameters are learned at training time, where the system is given a corpus of 
T training quadruples, {(^;^*'', y*-*-*)}?!!, with z'*' = (x^*-*, a;'-*\ y'-*') G S"^ represent- 
ing the actual data and the example input-output pair, and i/^*) e S the correct output 
on x^*^ Note that each quadruple here represents a different task; for example, one 
may represent the Oscar winners example of the previous section, another a generic 
email processing task, and so on. From these examples, the system chooses the param- 
eters 9 that maximize the likelihood Pr[/|z; 9]. We now describe how we model the 
conditional distribution Pr[/|z; 9] using a probabilistic context-free grammar 

^One could consider employing heuristic search techniques such as Genetic Programming. However, 
this requires picking a metric that captures meaningful search progress. This is difficult, since functions like 
sorting cause drastic changes on an input. Thus, standard metrics like edit distance may not be appropriate. 

^This is more general than the setup of e.g. 0, which assumes x and y have the same number of lines, 
each of which is treated as a separate example. 
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2.2 PCFGs for programs 



We maintain a probability distribution over programs with a Probabilistic Context- 
Free Grammar (PCFG) Q, as discussed in f9\. The grammar is defined by a set of 
non-terminal symbols V, terminal symbols S (which may include strings s e 5 and 
also other program-specific objects such as lists or functions), and rules TZ. Each rule 
r G TZ has an associated probability Pr[r|z; 9] of being generated given the system 
input z, where 9 represents the unobserved parameters of the grammar. WLOG, each 
rule r is also associated with a function /,. : j]'^'^''^'*^'') S, where NArgs(r) denotes 
the number of arguments in the RHS of rule r. A progranj^is a derivation of the start 
symbol Vstart- The probability of any program /(•) is the probability of its constituent 
rules TZf (counting repetitions): 

F4f\z;9]^Pv[nf\z;9]= []Pr[r|z;0]. (1) 

reUf 

We now describe how the distribution Pr[r|z; 0] is parameterized using clues. 

2.3 Features and clues for learning 

The learning process exploits the following simple fact: the chance of a rule being part 
of an explanation for a string pair {x, y) depends greatly on certain characteristics in 
the structure of x and y. For example, one interesting binary /eafwre is whether or not 
every line of y is a substring of x. If true, it may suggest that the select_f ield 
rule should receive higher probability in the PCFG, and hence will be combined with 
other rules more often in the search. Another binary feature indicates whether or not 
"Massachusetts" occurs repeatedly as a substring in y but not in x. This suggests that 
a rule generating the string "Massachusetts" may be useful. Conceptually, given a 
training corpus, we would like to learn the relationship between such features and the 
successful rules. However, there are an infinitude of such binary features as well as 
rules (e.g. a feature and rule corresponding to every possible constant string), but of 
course limited data and computational resources. So, we need a mechanism to estimate 
the relationship between the two entities. 

We connect features with rules via clues. A clue is a function c : ~^ 2^ 
that states, for each system input z, which subset of rules in TZ (the infinite set of 
grammar rules), may be relevant. This set of rules will be based on certain features 
of z, meaning that we search over compositions of instance-specific rule^ For ex- 
ample, one clue might return {E — > select_f ield(E, Delim, Int)} if each line of y 
is a substring of x, and otherwise. Another clue might recognize the input string 
is a permutation of the output string, and generate rules {E — > sort(E, COMP), E — > 

''Two programs from different derivations may compute exactly the f,ame function f : S S. However, 
determining whether two programs compute the same function is undecidable in general. Hence, we abuse 
notation and consider these to be different functions. 

^As long as the functions generated by our clues library include a Turing-complete subset, the class of 
functions being searched amongst is always the Turing-computable functions, though having a good bias is 
probably more useful than being Turing complete. 
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reverseSort(E, COMP). COMP alphaComp, . . .}, i.e., rules for sorting as well as intro- 
ducing a nonterminal along with corresponding rules for various comparison functions. 
Note that a single clue can suggest a multitude of rules for different z's (e.g. E ^ s for 
every substring s in the input), and "connmon" functions (e.g. concatenation of strings) 
may be suggested by multiple clues. 

We now describe our probabiUty model that is based on the clues formalism. 

2.4 Probability model 

Suppose the system has n clues ci, C2, . . . , c„. For each clue we keep an associated 
parameter 9i e M. Let TZz = Li2^iCi{z) be the set of instance-specific rules (wrt z) in 
the grammar. While the set of all rules TZ will be infinite in general, we assume there 
are a finite number of clues suggesting a finite number of rules, so that TZz is finite. For 
each rule r ^ T?.^, we take Pr[r|z] = 0, i.e. a rule that is not suggested by any clue is 
disregarded. For each rule r e 7?.^, we use the probabiUty model 



where for each nonterminal V, the normaUzer Zy ensures we get a valid probabiUty 
distribution: 



This is a log-linear model for the probabilities, where each clue has a weight e • , which 
is intuitively its reliability, and the probability of each rule is proportional to the prod- 
uct of the weights generating that rule. An alternative would be to make the proba- 
bilities be the (normalized) sums of corresponding weights, but we favor products for 
two reasons. First, as described shortly, maximizing the log-likelihood is a convex op- 
timization problem in 6 for products, but not for sums. Second, this formalism allows 
clues to have positive, neutral, or even negative influence on the likelihood of a rule, 
based upon the sign of 6i. 

3 System training and usage 

We are now ready to describe in full the operation of the training and inference phases. 
3.1 Training phase: learning 6 

At training time, we wish to learn the parameter 9 that characterizes the conditional 
probability of a program given the input, Pr[/|2;; 0]. We assume each training example 
z^*^ is also annotated with the "correct" program that explains both the example 
and actual data pairs. We may attempt to discover these annotations automatically 
by bootstrapping: we start with a uniform parameter estimate 9^^^ = 0. In iteration 




(2) 
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j = 1,2,3,.. ., we select /^■''*^ to be the most likely program, based on 6^^~^\ con- 
sistent with the system data. (If no program is found within the timeout, the example 
is ignored.) Then, parameters 9^^^ are learned, as described below. This is run until 
convergence. 

Fix a single iteration j. For notational convenience, we write target programs 
j(t) = /(^'*' and parameters 9 — 9''^\ We choose 9 so as to minimize the negative 
log-likelihood of the data, plus a regularization term: 

9 = argmin-logPr[/(*)|z(*);6l'] + Ar2(6l'), 

where Pr[/'^*^|z(*^; 9\ is defined by equations and (j2|i, the regularizer ^l{9) is the 
£2 norm ^||0||2, and A > is the regularization strength which may be chosen by 
cross-validation. If consists of rules rf', rj*"*, . . . , r^*^) (possibly with repetition), 
then 

The convexity of the objective follows from the convexity of the regularizer and the 
log-sum-exp function. The parameters 9 are optimized by gradient descent. 

3.2 Inference phase: evaluating on new input 

At inference time, we are given system input z — {x,x,y), n clues Ci, C2, . . . , c„, and 
parameters 9 e M" learned from the training phase. We are also given a timeout r. The 
goal is to infer the most likely program / that explains the data under a certain PCFG. 
This is done as follows: 

(i) We evaluate each clue on the system input z. The underlying PCFG Qz consists 
of the union of all suggested rules, TZz — U"=i 

(ii) Probabilities are assigned to these rules via Equation|2] using the learned param- 
eters 9. 

(iii) We enumerate over Qz in order of decreasing probability, and return the first 
discovered / that explains the {x, y) string transformation, or _L if we exceed the 
timeout. 

To find the most likely consistent program, we enumerate all programs of proba- 
bility at least 77 > 0, for any given ry. We begin with a large ry, gradually decreasing 
it and testing all programs until we find one which outputs y on a; (or we exceed the 
timeout r). (If more than one consistent program is found, we just select the most 
likely one.) Due to the exponentially increasing nature of the number of programs, this 
decreasing threshold approach imposes a negligible overhead due to redundancy - the 
vast majority of programs are executed just once. 

To compute all programs of probability at least 77, a dynamic program first computes 
the maximal probability of a full trace from each nonterminal. Given these values, it 
is simple to compute the maximal probability completion of any partial trace. We then 
iterate over each nonterminal expansion, checking whether applying it can lead to any 
programs above the threshold; if so, we recurse. 
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4 Results on prototype system 



To test the efficacy of our proposed system, we report results on a prototype web app 
implemented using client-side JavaScript and executed in a web browser on an Intel 
Core i7 920 processor. Our goal with the experiments is not to claim that our prototype 
system is "better" than existing systems in terms of functionality or richness. (Even 
if we wished to compare functionality, this would be difficult since all existing text 
processing systems that we are aware of are proprietary.) Instead, our aim is to evaluate 
whether learning weights using textual features - which has not been studied in any 
prior system, to our knowledge - can speed up inference. Nonetheless, we do attempt 
to construct a reasonably functional system so that our results can be indicative of what 
we might expect to see in a real-world text processing system. 

4.1 Details of base functions and clues 

As discussed in Section |2.2| we associated the rules in our PCFG with a set of base 
functions. In total we created around 100 functions, such as dedup, concatLists, 
and count, as described in Section fTTT] For clues to connect these functions to features 
of the examples, we had one set of base clues that suggested functions we believed to 
be common, regardless of the system input z (e.g. string concatenation). Other clues 
were designed to support common formats that we expected, such as dates, tabular 
and delimited data. Table |2] gives a sample of some of the clues in our system, in the 
form of grammar rules that certain textual features are connected to; in total we had 
approximately 100 clues. The full list of functions and clues is available as part of our 
supplementary material. 



Table 2: Sample of clues used. LIST denotes a list-, E a string-nonterminal. 



Feature 


Suggested rule(s) 


Substring s appears in output but not input? 


E -5> "s", LIST {E} 


Duplicates in input but not output? 


LIST dedup(LIST) 


Numbers on each input line but not output line? 


LIST count(LIST) 



4.2 Training set for learning 

To evaluate the system, we compiled a set of 280 examples with both an example 
pair {x, y) and evaluation pair (a;, y) specified. These examples were partly hand- 
crafted, based on various common usage scenarios the authors have encountered, and 
partly based on examples used in [3J. All examples are expressible as (possibly deep) 
compositions of our base functions; the median depth of composition on most examples 
is around 4. Like any classical learning model, we assume these are iid samples from 
the distribution of interest, namely over "natural' text processing examples. It is hard 
to justify this independence assumption in our case, but we are not aware of a good 
solution to this problem in general; even examples collected from a user study, say, 
will tend to be biased in some way. Table |3]gives a sample of some of the scenarios we 
tested the system on. To encourage future research on this problem, our suite of training 
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examples is ready for public release, and is available as part of our supplementary 
material. 

Table 3: Sample of test-cases used to evaluate the system. 



Input 


Output 


Adam Ant\nl Ray Rd . \nMA\n901 13 


90113 


28/5/2010 


June the 28th 2010 


612 Australia 


case 612: return Australia; 



4.3 Does learning help? 

The learning procedure aims to allow us to find the correct program in the shortest 
amount of time. We compare this method to a baseline, hoping to see quantifiable 
improvements in performance. 

Baseline. Our baseline is to search through the grammar in order of increasing pro- 
gram size, attempting to find the shortest grammar derivation that explains the trans- 
formation. The grammar does use clues to winnow down the set of relevant rules, but 
does not use learned weights: we let 9i = for all i, i.e. all rules that are suggested by 
a clue have the same constant probability. This method's performance lets us measure 
the impact of learning. Note that pure brute force search would not even use clues to 
narrow down the set of feasible grammar rules, and so would perform strictly worse. 
Such a method is infeasible for the tasks we consider, because some of them involve 
e.g. constant strings, which cannot be enumerated. 

Measuring performance. To assess a method, we look at its accuracy, as mea- 
sured by the fraction of correctly discovered programs, and efficiency, as measured by 
the time required for inference. As every target program in the training set is express- 
ible as a composition of our base functions, there are two ways in which we might fail 
to infer the correct program: (a) the program is not discoverable within the timeout set 
for the search, or (b) another program (one which also explains the example transfor- 
mation) is wrongly given a higher probability. We call errors of type (a) timeout errors, 
and errors of type (b) ranking errors. Larger timeouts lead to fewer timeout errors. 

Evaluation scheme. One possible pitfall in an empirical evaluation is having an 
overly specific set of clues for the training set: an extreme case would be a single clue 
for each training example, which automatically suggested the correct rules to com- 
pose. To ensure that the system is capable of making useful predictions on new data, 
we report the test error after creating 10 random 80-20 splits of the training set. For 
each split, we compare the various methods as the inference timeout r varies from 
{1/16, 1/8, ... , 16} seconds. For the learning method, we performed 3 bootstrap iter- 
ations (see Section [3T| ) with a timeout of 8 seconds to get annotations for each training 
example. 



Results. Figures 2(a) and 2(b) plot the timeout and ranking error rates respectively. 
As expected, for both methods, most errors arise due to timeout when the r is small. To 
achieve the same timeout error rate, learning saves about two orders of magnitude in 
T compared to the baseline. Learning also achieves lower mean ranking error, but this 
difference is not as pronounced as for timeout errors. This is not surprising, because the 



10 



60r 
50 
40 
30 
20 
10 
0- 



-■Baseline 
—Learning 




2 — ^ 2 — ^ 2 — ^ 2^ 2^ 2^ 2^ 
Inference timeout (sees) 



(a) Timeout eiTors. 




2^ 2 2 2 
Inference timeout (sees) 
(c) Mean speedup due to learning. 



3-14 

CD 12 

a 
^ 10 



--Baseline 
— LearningI 



Inference timeout (sees) 
(b) Ranking errors. 




2 2 2 2 2 
Baseline Inference time (sees) 

(d) Scatterplot of prediction times. 



Figure 2: Comparison of baseline versus learning approach. 



baseline generally finds few candidates in the first place (recall that the ranking error 
is only measured on examples that do not timeout); by contrast, the learning method 
opens the space of plausible candidates, but introduces a risk of some of them being 
incorrect. 

Figure 2(c) shows the relative speedup due to learning as t varies. We see that 



learning manages to cut down the prediction time by a factor of almost 40 over the 
baseline with t — 16 seconds. (The system would be even faster if implemented 
in a low-level programming language such as C instead of Javascript.) The trend of 
the curve suggests there are examples that the baseline is unable to discover with 16 



seconds, but learning discovers with far fewer. Figure 2(d) is a scatterplot of the times 
taken for both methods with t — 16 over all 10 train-test splits, confirms this: in 
the majority of cases, learning finds a solution in much less time than the baseline, 
and solves many examples the baseline fails on within a fraction of a second. (In 
some cases, learning slightly increases inference time. Here, the test example involves 
functions insufficiently represented in the training set.) 

Finally, Figure[3]compares the depths of programs (i.e. number of constituent gram- 
mar rules) discovered by learning and the baseline over all 10 train-test splits, with an 
inference timeout of t = 16 seconds. As expected, the learning procedure discovers 
many more programs that involve deep (depth > 4) compositions of rules, since the 
rules that are relevant are given higher probability. 
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Figure 3: Learnt program depths, r = 16s. "N/A" denotes that no successful program 
is found. 

5 Conclusion and future work 

We propose a PBE system for repetitive text processing based on exploiting certain 
clues in the input data. We show how one can learn the utility of clues, which relate 
textual features to rules in a context free grammar. This allows us to speed up the search 
process, and obtain a meaningful ranking over programs. Experiments on a prototype 
system show that learning with clues brings significant savings over naive brute force 
search. As future work, it would be interesting to learn correlations between rules 
and clues that did not suggest them, although this would necessitate enforcing some 
strong parameter sparsity. It would also be interesting to incorporate ideas like adaptor 
grammars |6| and learning program structure as in 1^. 
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